[Transform] Attention/Cache transforms #436

kylesayrs · 2025-08-26T23:46:45Z

Purpose

Support fully-expressive attention and kv cache quantization
Support running kv cache quantization evals with hf transformers

Prerequisites

Must be merged at the same time as [Quantization] Attention/ KV Cache Refactor llm-compressor#1651

Changes

New Classes

Add hookable attention and kvcache implementations which are registered to the attention module as submodules
- QuantizedAttentionImpl injects itself into the model by registering a new attention implementation called ct_hooked_attention overriding model.config._attn_implementation to be the new implementation name
- QuantizedKVCache injects itself into the model by overriding the past_key_values input kwarg to attention, and wrapping the functionality of the original cache
- Calibration and transform hooks can be added to these modules via the hook functions
  - register_query_hook,
  - register_key_hook
  - register_value_hook

Quantization Lifecycle Changes

Apply
- The kv_cache_scheme field of the quantization config is now used to call initialize_hooked_kv_cache
- Attention modules can now be targeted, and are used to call initialize_hooked_attention if attention modules are explicitly targeted (see is_narrow_match)
- Remove logic for "merging" kv cache schemes (this doesn't really make any sense, I'm not sure why it was ever included)
Initialize
- Hooked kv cache and attention modules have their quantization parameters initialized by initialize_module_for_quantization
- The presence of attention or kvcache submodules is what determines whether attention or kv cache only quantization is being applied
Serialization
- QuantizationConfig.from_pretrained was cleaned up with additional comments
- The kv_cache_scheme field is added if there are any attention modules with a quantization_scheme attached

Helpers

is_narrow_match is used to check that attention modules are being specifically targeted (rather than targeting all modules in a layer)
get_num_attn_heads, get_num_kv_heads, get_head_dim get attention config values from config

Testing

Added tests for is_narrow_match
Added tests for added attention and kvcache classes
Quantized models
- kylesayrs/Llama-3.2-1B-Instruct-attention-fp8-head
- kylesayrs/Llama-3.2-1B-Instruct-attention-nvfp4-head

Evaluation

eval.py

import sys
import lm_eval

model_id = sys.argv[1]

print(model_id)
results = lm_eval.simple_evaluate(
    # 3) hf serialized
    model="hf",
    model_args={
        "pretrained": model_id,
        "add_bos_token": False,
        "dtype": "auto",
        "device_map": "cuda",
        #"max_length": 128000,
    },
    device="cuda",
    # 3/)

    #tasks=["gsm8k_platinum", "mmlu_llama", "longbench2_single"],
    tasks=["gsm8k_platinum"],
    batch_size=64,
    apply_chat_template=True,
    fewshot_as_multiturn=True,
)
print(model_id)
print(lm_eval.utils.make_table(results))

compress.py

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs

# Select model and load it.
#model_id = "Qwen/Qwen2.5-14B-Instruct-1M"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Select calibration dataset.
DATASET_ID = "ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Configure the quantization algorithm to run.
args = QuantizationArgs(
    num_bits=8,
    type="float",
    strategy="attn_head",
    symmetric=True,
    observer="static_minmax",
)
recipe = QuantizationModifier(
    # config_groups={
    #     "attention": QuantizationScheme(
    #         #targets=["Qwen2Attention"],
    #         targets=["LlamaAttention"],
    #         input_activations=args,
    #     )
    # }
    kv_cache_scheme=args,
)

# Apply algorithms.
oneshot(
    model=model,
    dataset=DATASET_ID,
    splits={"calibration": f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]"},
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
sample = tokenizer("Hello my name is", return_tensors="pt")
sample = {key: value.to(model.device) for key, value in sample.items()}
output = model.generate(**sample, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + f"-KV-FP8-{args.strategy}-{args.observer}"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Model	GSM8K
nm-testing/Llama-3.1-8B-Instruct	0.8337
nm-testing/Llama-3.1-8B-Instruct-KV-FP8-Tensor	0.8271
nm-testing/Llama-3.1-8B-Instruct-KV-FP8-Head	0.8354
nm-testing/Llama-3.1-8B-Instruct-QKV-FP8-Tensor	0.8321
nm-testing/Llama-3.1-8B-Instruct-QKV-FP8-Head	0.8238

brian-dellabetta

This looks good, though i have a number of questions and minor suggestions

src/compressed_tensors/modeling/attention.py

src/compressed_tensors/modeling/kvcache.py

dsikka

If the goal is to use this generally for kv_cache and attn quantize, can we move the initialize_hooked_attention and initialize_hooked_kv_cache to initialize.py?

I understand we haven't hooked them in yet for those workflows but I think these belong there.

src/compressed_tensors/modeling/attention.py

dsikka

do a pass through on any missing docstring, otherwise lgtm.
nice work

src/compressed_tensors/modeling/kvcache.py

The base branch was changed.

brian-dellabetta

Following for the most part. A few clarifications, but this makes sense to me

src/compressed_tensors/modeling/attention.py

The base branch was changed.

kylesayrs · 2025-10-13T20:31:30Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/18477528039

kylesayrs · 2025-10-14T02:43:09Z

Last nightly worked, but e2e failed due to model storage issues
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/18483826999

brian-dellabetta

We can resolve the global var thread, I have another new comment we might want to consider in a follow-up but marking this as approved. Cool stuff! Excited to see it in action

src/compressed_tensors/modeling/attention.py

src/compressed_tensors/quantization/quant_config.py

dsikka

Just some questions. Otherwise, LGTM

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/modeling/kvcache.py

src/compressed_tensors/quantization/quant_config.py

dsikka

For the sake of completeness, do you mind adding your kv_cache and attn quantized sample models to this PR description?

src/compressed_tensors/modeling/kvcache.py

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-10-20T20:03:07Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/18663440870

brian-dellabetta

impressive work!

Signed-off-by: Kyle Sayers <[email protected]>

## Purpose ## * Support fully-expressive attention and kv cache quantization * Support running kv cache quantization evals with hf transformers * Resolves #1949 * Resolves #1928 ```python3 recipe = QuantizationModifier( config_groups={ "attention": QuantizationScheme( targets=["LlamaAttention"], input_activations=QuantizationArgs( num_bits=8, type="float", strategy="tensor" ), ) } ) ``` ```json { "quantization_config": { "config_groups": { "group_0": { "format": null, "input_activations": { "dynamic": false, "num_bits": 8, "observer": "minmax", "strategy": "tensor", "symmetric": true, "type": "float" }, "output_activations": null, "targets": [ "LlamaAttention" ], "weights": null } }, "format": "dense", "ignore": [], "kv_cache_scheme": { "dynamic": false, "group_size": null, "num_bits": 8, "observer": "minmax", "strategy": "tensor", "symmetric": true, "type": "float" }, "quant_method": "compressed-tensors", "quantization_status": "frozen", }, } ``` ## Prerequisites ## * Must be merged at the same time as vllm-project/compressed-tensors#436 ## Changes ## * Replace hooks * Remove `calibrate_kv_cache_input_hook`, `calibrate_kv_cache_output_hook`, `initialize_quantized_kv_cache` * Add `calibrate_query_hook` `calibrate_key_hook`, `calibrate_value_hook` * QuantizationMixin now initializes "q", "k", and "v" obsevers ([depending on the attached submodules](https://github.com/vllm-project/llm-compressor/pull/1651/files#diff-33303ae48e185b2fbb14dc45c2052805837deb5723248367b9579321c4c4e974R263-R270)) and adds the appropriate hooks * Miscellaneous * Fix minor shape bug in `_flatten_attention` * Add support for "attn_head" strategy in `_flatten_attention` * Tests * Removed old QuantizationKVCache tests (these classes are now tested [here])(https://github.com/neuralmagic/compressed-tensors/pull/436/files#diff-6e33ff48047dc4f7c9d969293f87e32e4d5ec3f3e8b741ea757780c8c0aab775) * Updated scale names to avoid using enum * Avoid unnecessary tokenization to reduce runtime ## Testing ## * Kv cache regression tests pass * Able to quantize attention with scripts (will add to examples once loadable in vllm) * kylesayrs/Llama-3.2-1B-Instruct-attention-fp8-head * kylesayrs/Llama-3.2-1B-Instruct-attention-nvfp4-head * Nightly passes (in progress) ## Evaluation ## <details><summary>eval.py</summary> ```python import sys import lm_eval model_id = sys.argv[1] print(model_id) results = lm_eval.simple_evaluate( # 3) hf serialized model="hf", model_args={ "pretrained": model_id, "add_bos_token": False, "dtype": "auto", "device_map": "cuda", #"max_length": 128000, }, device="cuda", # 3/) #tasks=["gsm8k_platinum", "mmlu_llama", "longbench2_single"], tasks=["gsm8k_platinum"], batch_size=64, apply_chat_template=True, fewshot_as_multiturn=True, ) print(model_id) print(lm_eval.utils.make_table(results)) ``` </details> <details><summary>compress.py</summary> ```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs # Select model and load it. #model_id = "Qwen/Qwen2.5-14B-Instruct-1M" model_id = "meta-llama/Llama-3.1-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_id) # Select calibration dataset. DATASET_ID = "ultrachat_200k" DATASET_SPLIT = "train_sft" # Select number of samples. 512 samples is a good place to start. # Increasing the number of samples can improve accuracy. NUM_CALIBRATION_SAMPLES = 512 MAX_SEQUENCE_LENGTH = 2048 # Configure the quantization algorithm to run. args = QuantizationArgs( num_bits=8, type="float", strategy="attn_head", symmetric=True, observer="static_minmax", ) recipe = QuantizationModifier( # config_groups={ # "attention": QuantizationScheme( # #targets=["Qwen2Attention"], # targets=["LlamaAttention"], # input_activations=args, # ) # } kv_cache_scheme=args, ) # Apply algorithms. oneshot( model=model, dataset=DATASET_ID, splits={"calibration": f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]"}, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, ) # Confirm generations of the quantized model look sane. print("\n\n") print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) sample = tokenizer("Hello my name is", return_tensors="pt") sample = {key: value.to(model.device) for key, value in sample.items()} output = model.generate(**sample, max_new_tokens=100) print(tokenizer.decode(output[0])) print("==========================================\n\n") # Save to disk compressed. SAVE_DIR = model_id.rstrip("/").split("/")[-1] + f"-KV-FP8-{args.strategy}-{args.observer}" model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR) ``` </details> Model | GSM8K -- | -- nm-testing/Llama-3.1-8B-Instruct | 0.8337 nm-testing/Llama-3.1-8B-Instruct-KV-FP8-Tensor | 0.8271 nm-testing/Llama-3.1-8B-Instruct-KV-FP8-Head | 0.8354 nm-testing/Llama-3.1-8B-Instruct-QKV-FP8-Tensor | 0.8321 nm-testing/Llama-3.1-8B-Instruct-QKV-FP8-Head | 0.8238 --------- Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs mentioned this pull request Aug 27, 2025

[Transform] Spinquant R3 vllm-project/llm-compressor#1778

Open

brian-dellabetta previously approved these changes Aug 27, 2025

View reviewed changes

dsikka reviewed Aug 28, 2025

View reviewed changes

src/compressed_tensors/modeling/attention.py Outdated Show resolved Hide resolved

kylesayrs force-pushed the kylesayrs/r3-only branch from 7bf4b57 to 75056bf Compare August 28, 2025 21:09

dsikka previously approved these changes Sep 2, 2025

View reviewed changes

src/compressed_tensors/modeling/kvcache.py Outdated Show resolved Hide resolved

Base automatically changed from kylesayrs/transform-simplify-key to main September 8, 2025 18:46

kylesayrs mentioned this pull request Oct 7, 2025

[Attention] Attention head quantization strategy #481

Merged

kylesayrs force-pushed the kylesayrs/r3-only branch 2 times, most recently from e224a5d to 05ec17e Compare October 8, 2025 19:20

kylesayrs changed the base branch from main to kylesayrs/add-attn-head-strat October 8, 2025 19:20

brian-dellabetta previously approved these changes Oct 8, 2025

View reviewed changes

src/compressed_tensors/modeling/attention.py Show resolved Hide resolved

src/compressed_tensors/modeling/attention.py Outdated Show resolved Hide resolved

kylesayrs marked this pull request as draft October 8, 2025 21:06

kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch from d084c5e to e3f24d4 Compare October 9, 2025 14:19

kylesayrs mentioned this pull request Oct 9, 2025

[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

Closed

kylesayrs changed the base branch from kylesayrs/add-attn-head-strat to main October 9, 2025 18:14

kylesayrs changed the base branch from main to kylesayrs/add-attn-head-strat October 9, 2025 18:15

kylesayrs mentioned this pull request Oct 9, 2025

[Quantization] Attention/ KV Cache Refactor vllm-project/llm-compressor#1651

Merged

kylesayrs force-pushed the kylesayrs/r3-only branch from 145c9aa to 2efe3db Compare October 9, 2025 18:35

Base automatically changed from kylesayrs/add-attn-head-strat to main October 9, 2025 20:11

kylesayrs force-pushed the kylesayrs/r3-only branch from 7c19358 to 04f716a Compare October 9, 2025 20:16

kylesayrs mentioned this pull request Oct 12, 2025

[Transforms] Use get_head_dim util vllm-project/llm-compressor#1918

Closed

kylesayrs marked this pull request as ready for review October 13, 2025 20:41

kylesayrs force-pushed the kylesayrs/r3-only branch 2 times, most recently from 4cc5ace to 9ead292 Compare October 14, 2025 04:21

kylesayrs mentioned this pull request Oct 14, 2025

[Quantization] Channel wise output activation quantization for QKV Attention layers #270

Closed

brian-dellabetta previously approved these changes Oct 14, 2025

View reviewed changes

src/compressed_tensors/modeling/attention.py Outdated Show resolved Hide resolved

src/compressed_tensors/quantization/quant_config.py Show resolved Hide resolved

kylesayrs mentioned this pull request Oct 15, 2025

[Bug]: k_scale and v_scale is zero after kv cache fp8 quantization vllm-project/llm-compressor#1928

Closed

kylesayrs dismissed brian-dellabetta’s stale review via dc43b64 October 15, 2025 17:14

brian-dellabetta previously approved these changes Oct 15, 2025

View reviewed changes

dsikka reviewed Oct 15, 2025

View reviewed changes

src/compressed_tensors/modeling/kvcache.py Show resolved Hide resolved

kylesayrs dismissed brian-dellabetta’s stale review via 0674268 October 15, 2025 22:07

kylesayrs added 5 commits October 20, 2025 11:05

attention quant

1c9bf45

Signed-off-by: Kyle Sayers <[email protected]>

reduce diff

35acc55

Signed-off-by: Kyle Sayers <[email protected]>

address nits

a9f6e1f

Signed-off-by: Kyle Sayers <[email protected]>

fix kv cache serialization, add tests

311a9ab

Signed-off-by: Kyle Sayers <[email protected]>

fix style

8c99f63

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/r3-only branch from eff4729 to 8c99f63 Compare October 20, 2025 15:05

do not force zp for attention

5225515

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta previously approved these changes Oct 20, 2025

View reviewed changes

populate ALL_MASK_ATTENTION_FUNCTIONS

a677372

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed brian-dellabetta’s stale review via a677372 October 21, 2025 19:49

brian-dellabetta approved these changes Oct 21, 2025

View reviewed changes

dsikka approved these changes Oct 23, 2025

View reviewed changes

kylesayrs merged commit e88e7d4 into main Oct 23, 2025
3 checks passed

kylesayrs deleted the kylesayrs/r3-only branch October 23, 2025 14:35

[Transform] Attention/Cache transforms #436

[Transform] Attention/Cache transforms #436

Uh oh!

Conversation

kylesayrs commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Prerequisites

Changes

New Classes

Quantization Lifecycle Changes

Helpers

Testing

Evaluation

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Oct 13, 2025

Uh oh!

kylesayrs commented Oct 14, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs commented Oct 20, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kylesayrs commented Aug 26, 2025 •

edited

Loading

dsikka left a comment •

edited

Loading